This is the final project for PSTAT 131, Intro to Statistical Machine
Learning. I chose to use the Ames Housing Data Set from Kaggle, which is
compiled by Dean De Cock for use in data science education. Since this
project is supervised machine learning and the testing data on Kaggle
doesn’t have the target variable Sale
Price, I will use Kaggle’s training data as the entire data set
and split it into training and testing data later on.
The goal is to predict the sale price of houses in Ames, Iowa using
the 80 variables provided in the data set. It is a regression problem.
Let’s begin!
## [1] 1460 81
The Ames Housing data set contains 81 variables for 1,460 houses. Out of the 81 variables, sale_price is the response variable. This leaves us 80 predictor variables.
After taking a closer look at each variable, I decided to exclude the id variable since it is meaningless in the settings of this project.
There are 19 variables with missing values.Three of them have over 90% values missing and all happen to be categorical:
1.alley is a categorical variable that describes the type of alley access to property. It has 93.77% values missing.
2.pool_qc is a categorical variable that describes the pool quality of houses. It has 99.52% values missing.
3.misc_feature, with 96.3% values missing, is a categorical variable that describes miscellaneous feature not covered in other categories.
From the data description on Kaggle, the ‘NA’ level in the above 3 variables simply means does not have. Since most of the houses in the data set don’t have allies, pools, or miscellaneous features, I will exclude these variables.
By the same logic, the misc_val and pool_area variable will be dismissed as well.
In this section, I will examine the relationships within each of
my predictor variables as well as their relationship with the response
variable - ‘sale_price’.
Since we are trying to predict the sale price, it would be
helpful to look at its distribution.
Ames %>%
ggplot(aes(x = sale_price)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = min(Ames$sale_price),color='red',linetype='dashed')+
scale_x_continuous(labels=dollar_format())+
labs(title = "Distribution of Sale Price")+
theme_bw()
From the histogram, the distribution of sale_price is right-skewed, and I will
perform log transformation on it in the model preparation step.
Additionally, the bars on the further right of the histogram suggest
there exist a few houses that were sold at extremely high prices. They
are likely to be the same houses identified in the previous step - the
ones with pools.
Let’s visualize a correlation matrix of the numeric variables
and also list out the correlation coefficients in descending order:
correlations <- Ames %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")%>% # handling missing data
corrplot(type = "upper", diag = FALSE,method = "circle")correlations <- Ames %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")
cor_with_sale_price <- correlations %>%
as.data.frame() %>%
rownames_to_column(var = "variable") %>%
select(variable, sale_price) %>%
arrange(desc(sale_price))
print(cor_with_sale_price)## variable sale_price
## 1 sale_price 1.00000000
## 2 overall_qual 0.79098160
## 3 gr_liv_area 0.70862448
## 4 garage_cars 0.64040920
## 5 garage_area 0.62343144
## 6 total_bsmt_sf 0.61358055
## 7 x1st_flr_sf 0.60585218
## 8 full_bath 0.56066376
## 9 tot_rms_abv_grd 0.53372316
## 10 year_built 0.52289733
## 11 year_remod_add 0.50710097
## 12 garage_yr_blt 0.48636168
## 13 mas_vnr_area 0.47749305
## 14 fireplaces 0.46692884
## 15 bsmt_fin_sf1 0.38641981
## 16 lot_frontage 0.35179910
## 17 wood_deck_sf 0.32441344
## 18 x2nd_flr_sf 0.31933380
## 19 open_porch_sf 0.31585623
## 20 half_bath 0.28410768
## 21 lot_area 0.26384335
## 22 bsmt_full_bath 0.22712223
## 23 bsmt_unf_sf 0.21447911
## 24 bedroom_abv_gr 0.16821315
## 25 screen_porch 0.11144657
## 26 mo_sold 0.04643225
## 27 x3ssn_porch 0.04458367
## 28 bsmt_fin_sf2 -0.01137812
## 29 bsmt_half_bath -0.01684415
## 30 low_qual_fin_sf -0.02560613
## 31 yr_sold -0.02892259
## 32 overall_cond -0.07785589
## 33 ms_sub_class -0.08428414
## 34 enclosed_porch -0.12857796
## 35 kitchen_abv_gr -0.13590737
The are a few takeaways from the plot and table above:
The highly positive correlation between sale_price and overall_qual (0.79) signifies a linear regression model that focuses on using overall_qual in the recipe might be very effective in predicting sale_price. Intuitively it also makes sense: people are willing to pay more for higher quality housing.
Multicolinearity in a few variables exist. For the pairs with coefficients over 0.5, I will drop the one with lower correlation coefficient with sale_price. Between:
(1) x1st_flr_sf and total_bsmt_sf, dropping x1st_flr_sf
(2) tot_rms_abv_grd and gr_liv_area, dropping tot_rms_abv_grd
(3) garage_yr_blt and year_built, dropping garage_yr_blt
(4) garage_cars and garage_area, dropping garage_area
Ames <- subset(Ames, select = -c(x1st_flr_sf, tot_rms_abv_grd, garage_yr_blt, garage_area, ms_sub_class, lot_frontage,lot_area,overall_cond ,bsmt_fin_sf1, bsmt_fin_sf2 ,bsmt_unf_sf ,x2nd_flr_sf, low_qual_fin_sf ,bsmt_full_bath, bsmt_half_bath, half_bath ,bedroom_abv_gr ,kitchen_abv_gr ,wood_deck_sf, open_porch_sf,enclosed_porch ,x3ssn_porch, screen_porch ,mo_sold ,yr_sold ,year_remod_add))
After dropping some irrelevant variables, the correlation plot looks
much clearer.
correlations <- Ames %>%
select(where(is.numeric)) %>%
cor(use = "pairwise.complete.obs")%>% # handling missing data
corrplot(type = "upper", diag = FALSE,method = "circle", addCoef.col = 1)I will closely anaalyze the distribution and significance of the categorical variables there.
Since we have over 1,000 observations, I will assume the population is
normally distributed. I will use ANOVA at 1% significance level on the
categorical variables to test that across different levels of the
categorical variables, in which ones is there difference in the means of
sale_price.
categorical_variables <- names(Ames)[!sapply(Ames, is.numeric)]
p_values <- sapply(categorical_variables, function(var) {
anova_result <- aov(as.formula(paste("sale_price ~", var)), data=Ames)
p_value <- summary(anova_result)[[1]][["Pr(>F)"]][1]
return(p_value)
})
sorted_vars <- names(p_values)[order(p_values)]
significant_vars <- sorted_vars[p_values[sorted_vars] < 0.01]
print(significant_vars)## [1] "neighborhood" "exter_qual" "kitchen_qual" "bsmt_qual"
## [5] "garage_finish" "foundation" "heating_qc" "garage_type"
## [9] "mas_vnr_type" "bsmt_fin_type1" "sale_condition" "exterior1st"
## [13] "exterior2nd" "bsmt_exposure" "sale_type" "ms_zoning"
## [17] "house_style" "lot_shape" "central_air" "fireplace_qu"
## [21] "electrical" "paved_drive" "roof_style" "bldg_type"
## [25] "bsmt_cond" "land_contour" "roof_matl" "condition1"
## [29] "garage_qual" "garage_cond" "exter_cond" "lot_config"
## [33] "functional" "heating" "fence"
We have 40 categorical variables aand 35 of them are significant.
However, I still need to narrow it down. Some of them have multiple
levels and if I apply One-Hot Encoding on all of them, there would be
too many columns in the recipe.
We will explore the top 4 categorical variables with the smallest
p-values: neighborhood, exter_qual, bsmt_qual, and kitchen_qual.
visualize_cat_feature <- function(data, feature, color) {
data[[feature]] <- with(data, reorder(data[[feature]], sale_price, median))
ggplot(data, aes_string(x = feature, y = "sale_price")) +
geom_boxplot(fill = color) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = paste('Sale Price vs.', feature),
x = feature,
y = 'Sale Price')
}
p241 <- visualize_cat_feature(Ames, "neighborhood", "green")## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation ideoms with `aes()`
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p242 <- visualize_cat_feature(Ames, "exter_qual", "blue")
p243 <- visualize_cat_feature(Ames, "bsmt_qual", "red")
p244 <- visualize_cat_feature(Ames, "kitchen_qual", "purple")
grid.arrange(p241, p242, p243, p244, ncol=2)Among different neighborhoods in Ames, the median house prices varies drastically. Nonetheless, I believe 25 different neighborhoods is too much and it may cause overfitting. Therefore, I need to recode the neighborhoods into regions based on the geography of Ames, Iowa.
The quality of exterior material of different houses also affects the sale prices. When a house has excellent quality materials, its sale price is likely to be double the amount of other quality houses.
bsmt_qual evaluates the height of the basement, and from our boxplot, houses with excellent basement quality (100+ inches in height) are the most expensive. The rest of the levels of basement heights have less affects on the sale price.
Higher kitchen quality also results in higher prices.
At this point, I have gained a sense that this data set contains lots
of measurements on multiple aspects of quality of houses. In the steps
below, I will explore their distribution across levels to see if they
are worth keeping or should be represented by the single variable overall_qual.
Now we have a clearer understanding of some of the issues within the
Ames Housing data set, we are going to modify the data set in this step
to prepare it for our models.
For numerical variables with high correlations to sale_price, I want to
use the z-scores of each variable to detect and remove outliers. Any
data that is 3 standard deviations away from the mean would be
considered outliers. Especially for gr_liv_area and total_bsmt_sf, the outliers on the bottom
right corners of the scatterplots are very noticeble.
#plots before removing outliers
plot_feature_vs_sale_price <- function(data, feature) {
ggplot(data, aes_string(x = feature, y = "sale_price")) +
geom_point(alpha = 0.5, size = 0.5) +
theme_minimal() +
labs(title = paste('Sale Price vs.', feature),
x = feature,
y = 'Sale Price')
}
p1 <- plot_feature_vs_sale_price(Ames, "overall_qual")
p2 <- plot_feature_vs_sale_price(Ames, "gr_liv_area")
p3 <- plot_feature_vs_sale_price(Ames, "garage_cars")
p4 <- plot_feature_vs_sale_price(Ames, "full_bath")
p5 <- plot_feature_vs_sale_price(Ames, "total_bsmt_sf")
p6 <- plot_feature_vs_sale_price(Ames, "year_built")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)features <- c('overall_qual', 'gr_liv_area', 'garage_cars', 'total_bsmt_sf',
'full_bath', 'year_built')
z_scores <- scale(Ames[, features])
outlier_rows <- apply(z_scores, 1, function(row) any(abs(row) > 3))
Ames <- Ames[!outlier_rows, ]
p7 <- plot_feature_vs_sale_price(Ames, "overall_qual")
p8 <- plot_feature_vs_sale_price(Ames, "gr_liv_area")
p9 <- plot_feature_vs_sale_price(Ames, "garage_cars")
p10 <- plot_feature_vs_sale_price(Ames, "full_bath")
p11 <- plot_feature_vs_sale_price(Ames, "total_bsmt_sf")
p12 <- plot_feature_vs_sale_price(Ames, "year_built")
grid.arrange(p7, p8, p9, p10, p11, p12, ncol=3)
Now we can no longer see outliers in our scatterplots.
We have 40 variables and lots of them have uneven distributions
across different levels.
c1<-Ames %>% ggplot(aes(x = ms_zoning)) + geom_bar()
c2<-Ames %>% ggplot(aes(x = lot_shape)) + geom_bar()
c3<-Ames %>% ggplot(aes(x = lot_config)) + geom_bar()
c4<-Ames %>% ggplot(aes(x = house_style)) + geom_bar()
c5<-Ames %>% ggplot(aes(x = roof_style)) + geom_bar()
c6<-Ames %>% ggplot(aes(x = exterior1st)) + geom_bar()
c7<-Ames %>% ggplot(aes(x = exterior2nd)) + geom_bar()
c8<-Ames %>% ggplot(aes(x = exter_qual)) + geom_bar()
c9<-Ames %>% ggplot(aes(x = foundation)) + geom_bar()
c10<-Ames %>% ggplot(aes(x = bsmt_qual)) + geom_bar()
c11<-Ames %>% ggplot(aes(x = heating_qc)) + geom_bar()
c12<-Ames %>% ggplot(aes(x = kitchen_qual)) + geom_bar()
c13<-Ames %>% ggplot(aes(x = garage_type)) + geom_bar()
c14<-Ames %>% ggplot(aes(x = sale_condition)) + geom_bar()
grid.arrange(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14, nrow = 3)
I will be combining the observations in those levels into one level
named “other”.
factor_cols <- sapply(Ames, is.factor)
Ames$ms_zoning <- Ames$ms_zoning %>% fct_lump_lowfreq()
Ames$lot_shape <- Ames$lot_shape %>% fct_lump_lowfreq()
Ames$lot_config <- Ames$lot_config %>% fct_lump_n(2)
Ames$house_style<- Ames$house_style %>% fct_lump_n(3)
Ames$roof_style <- Ames$roof_style %>% fct_lump_lowfreq()
Ames$exterior1st<- Ames$exterior1st %>% fct_lump_n(5)
Ames$exterior2nd<- Ames$exterior2nd %>% fct_lump_n(5)
Ames$exter_qual <- Ames$exter_qual %>% fct_lump_lowfreq()
Ames$foundation<- Ames$foundation %>% fct_lump_n(2)
Ames$heating_qc<- Ames$heating_qc %>% fct_lump_n(3)
Ames$kitchen_qual<- Ames$kitchen_qual%>% fct_lump_n(2)
Ames$garage_type<- Ames$garage_type%>% fct_lump_n(2)
Ames$sale_condition <- Ames$sale_condition %>% fct_lump_lowfreq()
c11<-Ames %>% ggplot(aes(x = ms_zoning)) + geom_bar()
c22<-Ames %>% ggplot(aes(x = lot_shape)) + geom_bar()
c33<-Ames %>% ggplot(aes(x = lot_config)) + geom_bar()
c44<-Ames %>% ggplot(aes(x = house_style)) + geom_bar()
c55<-Ames %>% ggplot(aes(x = roof_style)) + geom_bar()
c66<-Ames %>% ggplot(aes(x = exterior1st)) + geom_bar()
c77<-Ames %>% ggplot(aes(x = exterior2nd)) + geom_bar()
c88<-Ames %>% ggplot(aes(x = exter_qual)) + geom_bar()
c99<-Ames %>% ggplot(aes(x = foundation)) + geom_bar()
c1010<-Ames %>% ggplot(aes(x = bsmt_qual)) + geom_bar()
c1111<-Ames %>% ggplot(aes(x = heating_qc)) + geom_bar()
c1212<-Ames %>% ggplot(aes(x = kitchen_qual)) + geom_bar()
c1313<-Ames %>% ggplot(aes(x = garage_type)) + geom_bar()
c1414<-Ames %>% ggplot(aes(x = sale_condition)) + geom_bar()
grid.arrange(c11,c22,c33,c44,c55,c66,c77,c88,c99,c1010,c1111,c1212,c1313,c1414, nrow = 3)
For categorical variables with multiple levels, we will be dropping
variables with more than 85% observations in one level. Which include:
street land_contour utilities land_slope condition1 condition2 bldg_type
roof_matl exter_cond bsmt_cond bsmt_fin_type2 heating central_air
electrical functional garage_qual garage_cond paved_drive sale_type
## mas_vnr_type mas_vnr_area bsmt_qual bsmt_exposure bsmt_fin_type1
## 8 8 36 37 36
## fireplace_qu garage_type garage_finish fence
## 683 76 76 1157
Missing values are categorized into the following:
1. According to Kaggle's data description, the "NA" level in some categorical variables means "does not have". Therefore, I will replace the "NA" with "None".
2. The numerical variables 'mas_vnr_area' with its mode.
#categorical variables
Ames<-Ames
cate_to_replace <- c( 'mas_vnr_type','bsmt_qual', 'bsmt_exposure', 'bsmt_fin_type1',
'fireplace_qu', 'garage_type', 'garage_finish', 'fence')
Ames[cate_to_replace] <- lapply(Ames[cate_to_replace], function(col) {
ifelse(is.na(col), 'None', col)
})
#numerical variables
mode_value <- as.numeric(names(sort(table(Ames$mas_vnr_area), decreasing=TRUE)[1]))
Ames$mas_vnr_area[is.na(Ames$mas_vnr_area)] <- mode_value
#categorical to factor
Ames[sapply(Ames, is.character)] <- lapply(Ames[sapply(Ames, is.character)], as.factor)## named integer(0)
All missing values are handled.
library(forcats)
Ames %>%
ggplot(aes(y = forcats::fct_infreq(neighborhood))) +
geom_bar(fill='#003399') +
theme_base() +
ylab("Neighborhood")
For the neighborhood variable, it has 25 levels and have a fair number
of observations in each level, so we will recode it into regions
according to the map of Ames, Iowa.
Ames <- Ames%>%
mutate(region = forcats::fct_collapse(neighborhood,
west = c("SawyerW", "Sawyer", "ClearCr", "Edwards",
"CollgCr","Crawfor", "SWISU", "Blueste",'Timber'),
north = c("NridgHt", "NoRidge", "Gilbert", "NWAmes",
"Somerst", "Veenker", "Blmngtn", "BrkSide",
"NPkVill",'BrDale','StoneBr','NAmes','OldTown',
'IDOTRR'),
southeast = c("Mitchel", "MeadowV"))) %>%
select(-neighborhood)
Ames %>%
ggplot(aes(y = forcats::fct_infreq(region))) +
geom_bar(fill='#003399') +
theme_base() +
ylab("Region")
Now that I recoded the neighborhood
variable, I want to see if my previous take on 3-car-garage houses being
the most expensive ones because of their location is reasonable.
Ames %>%
ggplot(aes(x = as.factor(region), y = sale_price)) +
geom_boxplot(aes(fill = as.factor(garage_cars))) +
theme_minimal() +
labs(x = "Regions", y = "Sale Price", fill = "garage_cars", title = "Sale Price by Garage Capacity and Region") +
theme(legend.position = "bottom")
In the boxplot, houses in the north region and have 3-car-garage are the
most expensive ones out of all houses in Ames, Iowa. Across garage
types, north region houses are generally as expensive if not more than
the other region houses, and the maximum prices are also higher in the
north region.
Ames %>% ggplot(aes(y = sale_price, fill = region,
x = factor(ms_zoning))) +
stat_summary(fun = "mean", geom = "bar",
position = "dodge") + theme_bw()
MSZoning: Identifies the general zoning classification of the
sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
The north region mostly consists of commercial and floating village
residential zones. Which explains why this region has an overall higher
sale prices than others. The southeast region mainly is low and medium
density residential zone. Having a simple occupant type, this explains
the small range of house price in the southeast region.
After I prepared the data and gained understanding of the variables in
the Ames Housing data set, I will set up the recipe and create folds for
cross validation for my models.
I use log transformation on the response variable, sale price, because from the histogram
before, I noticed its skewness. Additionally, sale prices are mostly
very large numbers and this will effect the evaluation of our model
performance.
As for the splitting, I am chosing a 70/30 split because as I mentioned
in the introduction, this data set was originally the training data set
for the Kaggle Competition. Therefore, 30% of the 1,400 observations
should be adequate for testing the models. Lastly, to ensure the even
distribution of the response variable in the training and testing data,
I stratified the split on sale price
as well.
## <Training/Testing/Total>
## <1001/431/1432>
I will use a simple recipe on the linear regression model because the
correlation coefficient to sale price of most numerical variables are
very high. For the other models, I will only use one recipe for all
models. The categorical data will be dummy-coded in the recipe creaation
step. I will standardize all predictors as well by centering and
normalizing them. This is because the numerical ones have different
ranges, and I want to evaluate the model performances easily.
ames_recipe <- recipe(sale_price ~ ., data = ames_train) %>%
#dummy coding nominal variables
step_dummy(all_nominal_predictors())%>%
#step_zv(all_predictors()) %>%
#normalize
step_center(all_predictors()) %>%
step_scale(all_predictors())
#prep and bake
prep(ames_recipe) %>%
bake(new_data = ames_train) %>%
head() ## # A tibble: 6 × 70
## overall_qual year_built mas_vnr_area total_bsmt_sf gr_liv_area full_bath
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.704 -1.37 -0.590 -0.214 0.609 0.846
## 2 -0.781 -1.10 -0.590 -0.113 -0.878 -1.03
## 3 -0.781 -0.210 -0.590 0.0143 -0.957 -1.03
## 4 -1.52 -0.142 -0.590 -2.68 -0.411 0.846
## 5 -1.52 -1.50 -0.590 -1.33 -2.07 -1.03
## 6 -1.52 -1.74 -0.590 -0.998 -0.366 -1.03
## # ℹ 64 more variables: fireplaces <dbl>, garage_cars <dbl>, sale_price <dbl>,
## # ms_zoning_Other <dbl>, lot_shape_Other <dbl>, lot_config_Inside <dbl>,
## # lot_config_Other <dbl>, house_style_X1Story <dbl>,
## # house_style_X2Story <dbl>, house_style_Other <dbl>, roof_style_Other <dbl>,
## # exterior1st_MetalSd <dbl>, exterior1st_Plywood <dbl>,
## # exterior1st_VinylSd <dbl>, exterior1st_Wd.Sdng <dbl>,
## # exterior1st_Other <dbl>, exterior2nd_MetalSd <dbl>, …
linear_recipe <- recipe(sale_price ~ overall_qual + year_built +
total_bsmt_sf + gr_liv_area + house_style, data = ames_train) %>%
step_dummy(house_style) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
prep(linear_recipe) %>%
bake(new_data = ames_train) %>%
head()## # A tibble: 6 × 8
## overall_qual year_built total_bsmt_sf gr_liv_area sale_price
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.704 -1.37 -0.214 0.609 5.11
## 2 -0.781 -1.10 -0.113 -0.878 5.07
## 3 -0.781 -0.210 0.0143 -0.957 5.11
## 4 -1.52 -0.142 -2.68 -0.411 4.95
## 5 -1.52 -1.50 -1.33 -2.07 4.84
## 6 -1.52 -1.74 -0.998 -0.366 4.60
## # ℹ 3 more variables: house_style_X1Story <dbl>, house_style_X2Story <dbl>,
## # house_style_Other <dbl>
We will use 5 models in total: Linear Regression, K-Nearest Neighbor,
Elastic Net Regression, Random Forest, and Gradient-Boosted Trees. I
decided not to use elastic net regression instead of polynomial
regression model for its ability to handle multicolinearity within my
data set. In addition, it can be inferred from the previous
visuallizations that the predictor-response relationship is likely to be
linear. Polynomial regression is not a good way to capture linear trends
and therefore should be excluded.
For their evaluation, I will use the RMSE (Root Mean Squared Error).
Since the RMSE measures the distance from the prediction to the actual
values, the lower the RMSE, the better the model performed.
#Linear Regression
lm_mod <- linear_reg() %>%
set_mode("regression") %>%
set_engine("lm")
#K Nearest Neighbors
knn_mod <- nearest_neighbor(neighbors = tune())%>%
set_mode("regression")%>%
set_engine("kknn")
#Elastic Net
en_reg_spec <- linear_reg(penalty = tune(),mixture = tune()) %>%
set_mode("regression")%>%
set_engine("glmnet")
#Random Forest
rf_reg_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
#Gradient_Boosted Trees
bt_reg_spec <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")#Linear Regression
lm_workflow <- workflow()%>%
add_model(lm_mod)%>%
add_recipe(linear_recipe)
#K Nearest Neighbors
knn_workflow <- workflow()%>%
add_model(knn_mod)%>%
add_recipe(ames_recipe)
#Elastic Net
en_workflow <- workflow()%>%
add_model(en_reg_spec)%>%
add_recipe(ames_recipe)
#Random Forest
rf_workflow <- workflow()%>%
add_model(rf_reg_spec)%>%
add_recipe(ames_recipe)
#Gradient_Boosted Trees
bt_workflow <- workflow()%>%
add_model(bt_reg_spec)%>%
add_recipe(ames_recipe)#K Nearest Neighbors
knn_grid <- grid_regular(neighbors(range=c(1,20)),levels=10)
#Elastic Net
en_grid <- grid_regular(penalty(range = c(0,0.1),trans = identity_trans()),mixture(range = c(0, 1)),levels = 10)
#Random Forest
rf_grid <- grid_regular(mtry(range = c(2, 8)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 4)
#GBT
bt_grid <- grid_regular(mtry(range = c(2, 8)),
trees(range = c(100, 400)),
learn_rate(range = c(0.01, 0.1)),
levels = 5)
Elaborate on Parameters:
K Nearest Neighbors: I chose 1 through 10 neighbors for the model to learn because smaller number of neighbors can better capture the patterns within the data but it will also be more sensitive to outliers, noise, etc. Large number of neighbors is the other way around.
Elastic Net: it combines the regularization of both lasso and Ridge. For the penalty, I picked the range of 0 to 0.1 because larger values mean more regularization. Regularization can help prevent overfitting by penalizing large coefficients, but too much can oversimplify the model. Then for the mixture, I chose 0 to 1 so that I can compare every combination between Ridge and Lasso.
Random Forest:A range of 2 to 8 might make sense because this allows the model to test various subsets of predictors at each split, which might help in finding an optimal subset.For trees, 200 to 600 is a common and standard approach to ensure robust learning but not to the point where adding more trees doesn’t improve the model. The minimum number of data points in a node that are required for the node to be split further (min_n) is 10 to 20. A smaller value would allow deeper trees, potentially capturing more intricate patterns in the data, but at the risk of overfitting. The range 10 to 20 provides a balanced approach. Level of 4 just means there are 4^3 combinations of parameters for the model.
Gradient-Boosted Trees: most range of parameters are similar to the
random forest model, by choosing 2 to 8 for mtry, I am testing various
subsets of predictors to find an optimal subset for each individual
decision tree in the boosting process. The default learn rate is 0.01 to
0.1.
4.Tuning models and saving the results
# K Nearest Neighbors
knn_tune <- tune_grid(
knn_workflow,
resamples = ames_folds,
grid = knn_grid
)
save(knn_tune, file = "knn_tune.rda")
# Elastic Net
en_tune <- tune_grid(
en_workflow,
resamples = ames_folds,
grid = en_grid,
control = control_grid(verbose = TRUE)
)
save(en_tune, file = "en_tune.rda")
# Random Forest
rf_tune_res <- tune_grid(
rf_workflow,
resamples = ames_folds,
grid = rf_grid,
control = control_grid(verbose = TRUE)
)
save(rf_tune_res, file = "rf_tune_res.rda")
#GBT
bt_tune_res <- tune_grid(
bt_workflow,
resamples = ames_folds,
grid = bt_grid,
control = control_grid(verbose = TRUE)
)
save(bt_tune_res, file = "bt_tune_res.rda")load("knn_tune.rda")
load("en_tune.rda")
load("rf_tune_res.rda")
load("bt_tune_res.rda")
# Linear Regression
lm_res <- lm_workflow %>% fit_resamples(ames_folds)
lm_rmse <- collect_metrics(lm_res) %>% slice(1)
# KNN
knn_rmse <- collect_metrics(knn_tune) %>% arrange(mean) %>% slice(1)
# Elastic Net
en_rmse <- collect_metrics(en_tune) %>% arrange(mean) %>% slice(1)
# Random Forest
rf_rmse <- collect_metrics(rf_tune_res) %>% arrange(mean) %>% slice(1)
# Boosted Trees
bt_rmse <- collect_metrics(bt_tune_res) %>% arrange(mean) %>% slice(1)
# Creating a tibble of all the models and their RMSE
final_compare_tibble <- tibble(
Model = c("Linear Regression", "K Nearest Neighbors", "Elastic Net", "Random Forest", "Boosted Trees"),
RMSE = c(lm_rmse$mean, knn_rmse$mean, en_rmse$mean, rf_rmse$mean, bt_rmse$mean)
)
# Arranging by lowest RMSE
final_compare_tibble <- final_compare_tibble %>%
arrange(RMSE)
final_compare_tibble## # A tibble: 5 × 2
## Model RMSE
## <chr> <dbl>
## 1 Elastic Net 0.0618
## 2 Random Forest 0.0652
## 3 Linear Regression 0.0707
## 4 K Nearest Neighbors 0.0810
## 5 Boosted Trees 0.0890
The Elastic Net model has the lowest RMSE of 0.061485. Let’s
also visualize the RMSEs side by side.
# Creating a data frame of the model RMSE's
all_models <- data.frame(
Model = c("Linear Regression", "K Nearest Neighbors", "Elastic Net", "Random Forest", "Boosted Trees"),
RMSE = c(lm_rmse$mean, knn_rmse$mean, en_rmse$mean, rf_rmse$mean, bt_rmse$mean)
)
# Plotting the RMSE values using a bar chart
ggplot(all_models, aes(x=Model, y=RMSE)) +
geom_bar(stat = "identity", aes(fill = Model)) +
scale_fill_manual(values = c("lightblue", "lightcoral", "lightsalmon", "lightgreen", "lightpink")) +
theme(legend.position = "none", axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Comparing RMSE by Model")## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
For the elastic net model, the scale of the y-axis for both metrics is
relatively small. This indicates that the resulting performance doesn’t
really vary drastically across any of the models we’ve fit. The amount
of regularization, on the x-axis, is the penalty hyperparameter, which
covers the range of values we specified (0 to 0.1), and the values of
mixture are represented by the different-colored lines. Smaller penalty
and mixture produce better RMSE across the 10 folds. As the penalty
increases, the model does worse because the coefficients of the
predictors are being reduced to very small values (due to the penalty),
which makes it much more difficult for the model to predict.
Overall, the models with zero percentage of lasso penalty, or the
ridge regression models, do better, as indicated by the red line being
consistently higher (or lower) than the others. This implies that it
yields better performance to avoid reducing predictors all the way down
to zero, as can happen in the case of lasso regression.
For the boosted tree model, across 5 levels, as the number of randomly
selected predictors increases, the RMSE of the model tends to decrease.
The learning rate and the model RMSE have an inverse relationship. The
larger the learning rate, the worse the model learned. The number of
trees does not make significant impact on the model performance.
Overall, from the graph, between 4-6 randomly selected predictors work
the bset especially when learning rate is low. A high learning rate
causes the model to learn faster, but it also trains the model less and
makes it less generalized.
For the random forest model, the higher number of randomly selected
predictors, the lower the RMSE. However, the number of trees and the
minimal node size makes little impact on the RMSE. For the wage data, as
we increase mtry, model performance seems to improve – RMSE tends to
decrease. The number of trees, doesn’t make much of a difference
overall, as those lines are virtually on top of each other. A minimal
node size, or min_n, of 10 seems to produce slightly better results than
a minimum size of 20.
Since the elastic net regression model performed the best out of the
5 models, let’s pull up the entry of parameters of the best elastic net
regression model.
## # A tibble: 1 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.0111 0.111 rmse standard 0.0618 10 0.00180 Preprocessor1_Model012
Low penalty and mixture suggest the reduction in regularization,
indicating my predictors are not as collinear as I assumed. Also, the
low strength on regualarization might be due to the outlier removal in
ealier step.
Elastic Net is our best model, and we will be fitting it to the
training and testing data.
# Fitting to the training data
best_en_train <- select_best(en_tune, metric = 'rmse')
en_final_workflow_train <- finalize_workflow(en_workflow, best_en_train)
en_final_fit_train <- fit(en_final_workflow_train, data = ames_train)
# Creating the predicted vs. actual value tibble
ames_tibble <- predict(en_final_fit_train, new_data = ames_test %>% select(-sale_price))
ames_tibble <- bind_cols(ames_tibble, ames_test %>% select(sale_price))
ames_tibble %>%
ggplot(aes(x = 10^.pred, y = 10^sale_price)) +
geom_point(alpha = 0.4) +
geom_abline(lty = 2) +
theme_grey() +
coord_obs_pred() +
labs(title = "Predicted Values vs. Actual Values")
Most points are on the line y=x, suggesting the elastic net regression
model did well in predicting the majority of slae prices in the testing
data. The few outliers on the top right corner show that the model
underestimates sale prices of more expensive houses. My explaination for
this is that there weren’t enough data on the high-end houses to train
the models on.
# make predictions for the testing data and take a look at its testing RMSE:
best_en_reg <- select_best(en_tune,metric = 'rmse')
final_en_model <- finalize_workflow(en_workflow, best_en_reg)
final_en_model <- fit(final_en_model, ames_train)
final_en_model_test <- augment(final_en_model, ames_test)
rmse(final_en_model_test, truth = sale_price, .pred)## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0590
A VIP plot further examines the important predictors.
This plot tells us that the two most useful predictors of sale price in
this elastic net model are gr_liv_area and overall_qual.
After modeling and analysis of the Ames Housing data set, the Elastic Net model emerged as the most effective in predicting house prices. This was anticipated since Elastic Net combines the strengths of both Ridge and Lasso regression, allowing it to handle multicollinearity efficiently while also performing feature selection. Despite its strengths, the model’s predictions, as indicated by the RMSE, suggest there’s room for improvement.
On the other end of the spectrum, the K-Nearest Neighbors (KNN) model performed the least effectively. The challenge with KNN, especially in a dataset like ours with numerous predictors, is the curse of dimensionality. As dimensions increase, data points tend to drift apart, making it harder for KNN to find close neighbors, thus affecting prediction accuracy.
As someone who has always been interested in the real estate market, this experience provided me a way to practice my machine learing techniques on prepared data. In the future, I want to go deeper into the data and possibly explore other variables not present in the current dataset that might impact house prices. For instance, understanding more about the neighborhood’s amenities, such as proximity to schools, parks, or public transport, could provide further insights. Another intriguing aspect to consider might be the historical significance of some houses or whether they’re located in a heritage precinct, as these factors can significantly influence price.
Nonetheless, a lingering thought could be around the decision to transform certain variables or the choice of polynomial terms for some predictors. Experimenting with different transformations or adding interactions could potentially enhance the model’s performance.
If I were to further this research, I would also be keen to compare the Ames housing dataset with housing data from other regions or cities. It would be captivating to discern if house pricing trends and influential factors remain consistent across different geographical locales.
All in all, this journey into predicting house prices with the Ames data set has been both challenging and enlightening. It’s been a rigorous exercise in refining machine learning skills and diving deep into data analysis. In particular, I became aware of the importance of avoiding train-test leakage and I see areas of improvement of my skills through this project. While the Elastic Net model may not be flawless, it’s rewarding to have developed a model that provides a reasonable understanding of the factors influencing model performances.